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A Spatial Layout and Scale Invariant Feature 
Representation for Indoor Scene Classification 

M. Hay at, S. H. Khan, M. Bennamoun, and S. An, 


Abstract —Unlike standard object classification, where 
the image to be classified contains one or multiple instances 
of the same object, indoor scene classification is quite 
different since the image consists of multiple distinct 
objects. Further, these objects can be of varying sizes and 
are present across numerous spatial locations in different 
layouts. For automatic indoor scene categorization, large 
scale spatial layout deformations and scale variations are 
therefore two major challenges and the design of rich 
feature descriptors which are robust to these challenges 
is still an open problem. This paper introduces a new 
learnable feature descriptor called “spatial layout and 
scale invariant convolutional activations” to deal with these 
challenges. For this purpose, a new Convolutional Neural 
Network architecture is designed which incorporates a 
novel ‘Spatially Unstructured’ layer to introduce robust¬ 
ness against spatial layout deformations. To achieve scale 
invariance, we present a pyramidal image representation. 
For feasible training of the proposed network for images 
of indoor scenes, the paper proposes a new methodology 
which efficiently adapts a trained network model (on a 
large scale data) for our task with only a limited amount of 
available training data. Compared with existing state of the 
art, the proposed approach achieves a relative performance 
improvement of 3.2%, 3.8%, 7.0%, 11.9% and 2.1% on 
MIT-67, Scene-15, Sports-8, Graz-02 and NYU datasets 
respectively. 

Index Terms —Indoor Scenes Classification, Spatial Lay¬ 
out Variations, Scale Invariance 


I. Introduction 

Recognition/classification is an important computer 
vision problem and has gained significant research at¬ 
tention over last few decades. Most of the efforts, in this 
regard, has been tailored towards generic object recogni¬ 
tion (an image with one or multiple instances of the same 
object) and face recognition (an image with the face 
region of the person). Unlike these classification tasks, 
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indoor scene classification is quite different since an im¬ 
age of an indoor scene contains multiple distinct objects, 
with different scales and sizes and laid across different 
spatial locations in a number of possible layouts. Due to 
the challenging nature of the problem, the state of the 
art performance for indoor scene classification is much 
lower (69% classification accuracy on MIT-67 dataset 
with only 67 classes (7)) compared with other classi¬ 
fication tasks such as object classification (94% rank- 
5 identification rate on ImageNet database with 1000 
object categories (36) ) and face recognition (human level 
performance on face recognition on real life datasets 
including Labeled Faces in the Wild and YouTube Faces 
f39|). This paper proposes a novel method of feature 
description, specifically tailored for indoor scene images, 
in order to address the challenges of large scale spatial 
layout deformations and scale variations. 

We can characterize some indoor scenes by only 
global spatial information [26], (31], whereas for others, 
local appearance information (5), (16), (23) is more 
critical. For example, a corridor can be predominantly 
characterized by a single large object {walls) whereas 
a bedroom scene is characterized by multiple objects 
(e.g, sofa , bed , table). Both global and local spatial 
information must therefore be leveraged in order to 
accommodate different scene types (30]. This however is 
very challenging, for two main reasons. First, the spatial 
scale of the constituent objects varies significantly across 
different scene types. Second, the constituent objects can 
be present in different spatial locations and in a number 
of possible layouts. This is demonstrated in the example 
images of the kitchen scene in Fig. [lj where a microwave 
can be present in many different locations in the image 
with significant variations in scale, pose and appearance. 

This paper aims to achieve invariance with respect to 
the spatial layout and the scale of the constituent objects 
for indoor scene images. For this purpose, in order to 
achieve invariance with respect to the spatial scale of 
objects, we generate a pyramidal image representation 
where an image is resized to different scales, and features 
are computed across these scales (Sec |III-C] ). To achieve 
spatial layout invariance, we introduce a new method of 
feature description which is based on a proposed mod- 
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Fig. 1: The spatial structure of indoor scenes is loose, 
irregular and unpredictable which can confuse the classi¬ 
fication system. As an example, a microwave in a kitchen 
scene can be close to the sink, fridge, kitchen door or 
top cupboards {green box in the images). Our objective 
is to learn feature representations which are robust to 
these variations by spatially shuffling the convolutional 


activations (Sec. III). 


large datasets e.g., ImageNet) can be adapted for similar 
tasks with limited additional training data However, 
cross domain adaptation becomes problematic in the case 
of heterogeneous tasks due to the different natures of 
source and target datasets. For example, an image in 
the ImageNet dataset contains mostly centered objects 
belonging to only one class. In contrast, an image in 
an indoor scene dataset has many constituent objects, all 
appearing in a variety of layouts and scales. In this work, 
we propose an efficient strategy to achieve cross domain 
adaptation with only a limited number of annotated 


ified Convolutional Neural Network (CNN) architecture 
(Sec. liTTAl ). 

CNNs preserve the global spatial layout in an image. 
This is desirable for the classification tasks where an 
image predominantly contains only a single object (e.g., 
objects in ImageNet database (32)). However, for a high 
level vision task such as indoor scene classification, 
an image may contain multiple distinct objects across 
different spatial locations. We therefore want to devise 
a method of feature description which is robust with 
respect to the spatial layout of objects in a scene. 
Although commonly used local pooling layers (max or 
mean pooling) in standard CNN architectures have been 
shown to achieve viewpoint and pose invariance to some 
extent [9], fl4) , these layers cannot accommodate large- 
scale deformations that are caused by spatial layout 
variations in indoor scenes. In order to achieve spatial 
layout invariance, this paper introduces a modified CNN 
architecture with an additional layer, termed ‘spatially 
unstructured layer’ (Sec. |III-A| ). The proposed CNN is 
then trained with images of indoor scenes (using our 
proposed strategy described in Sec. |III-B| ) and the learnt 
feature representations are invariant to the spatial layout 
of the constituent objects. 

Training a deep CNN requires a large amount of 
data because the number of parameters to be learnt is 
quite huge. However, for the case of indoor scenes, 
we only have a limited number of annotated training 
data. This becomes then a serious limitation for the 
feasible training of a deep CNN. Some recently proposed 
techniques demonstrate that pre-trained CNN models (on 


training images in the target dataset (Sec. |III-B ). 

The major contributions of this paper can be summa¬ 
rized as: 1) A new method of feature description (using 
the activations of a deep convolutional neural network) 
is proposed to deal with the large-scale spatial layout de¬ 
formations in scene images (Sec |III-A| ), 2) A pyramidal 
image representation is proposed to achieve scale invari¬ 
ance (Sec |III-C| ), 3) A novel transfer learning approach 
is introduced to efficiently adapt a pre-trained network 
model (on a large dataset) to any target classification 
task with only a small amount of available annotated 


training data (Sec |III-B| ) and 4) Extensive experiments 
are performed to validate the proposed approach. Our 
results show a significant performance improvement for 
the challenging indoor scene classification task on a 
number of datasets. 

II. Related Work 

Indoor scene classification has been actively re¬ 
searched and a number of methods have been developed 

S (31), (37), 0, 


in recent years [161, 


(5TJ. While some of these methods focus on the holistic 
properties of scene images (e.g., CENTRIST [45], Gist 
descriptor [26]), others give more importance to the local 
distinctive aspects (e.g., dense SIFT [16], HOG (46)). 
In this paper, we argue that we cannot rely on either 
of the local or holistic image characteristics to describe 
all indoor scene types (30) . For some scene types, 
holistic or global image characteristics are enough (e.g., 
corridor ), while for others, local image properties must 
be considered (e.g., bedroom , shop). We therefore neither 
focus on the global nor the local feature description and 
instead extract mid-level image patches to encode an 
intermediate level of information. Further, we propose a 
pyramidal image representation which is able to capture 
the discriminative aspects of indoor scenes at multiple 
levels. 

Recently, mid-level representations have emerged as 
a competitive candidate for indoor scene classification. 
Strategies have been devised to discover discriminative 
mid-level image patches which are then encoded by a 
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feature descriptor. For example, the works 0 , GD, 
f38| learn to discover discriminative patches from the 
training data. Our proposed method can also be cate¬ 
gorized as a mid-level image patches based approach. 
However, our method is different from previous methods, 
which require discriminative patch ranking and selec¬ 
tion procedures or involve the learning of distinctive 
primitives. In contrast, our method achieves state of the 
art performance by simply extracting mid-level patches 
densely and uniformly from an image (see more details 
in Sec. HITdI 

An open problem in indoor scene classification is the 
design of feature descriptors which are robust to global 
layout deformations. The initial efforts to resolve this 
problem used bag-of-visual-words models or variants 
(e.g., 0 , (16]|, (47)), which are based on locally invariant 
descriptors e.g., SIFT J22| . Recently, these local fea¬ 
ture representations have been outperformed by learned 
feature representations from deep neural networks (14), 
[31], [32]. However, since there is no inherent mech¬ 
anism in these deep networks to deal with the high 
variability of indoor scenes, several recent efforts have 
been made to fill in this gap (e.g., ( 7 ), 0 ). The bag 
of features approach of Gong et al. 0 performs VLAD 
pooling m of CNN activations. Another example is 
the combination of spatial pyramid matching and CNNs 
(proposed by He et al. j9]|) to increase the feature’s 
robustness. These methods, however, devise feature rep¬ 
resentations on top of CNN activations and do not inher¬ 
ently equip the deep architectures to effectively deal with 
the large deformations. In contrast, this work provides 
an alternative strategy based on an improved network 
architecture to enhance invariance towards large scale 
deformations. The detailed description of our proposed 
feature representation method is presented next. 


III. Proposed Spatial Layout and Scale 
Invariant Convolutional Activations - S 2 ICA 

The block diagram of our proposed Spatial Layout and 
Scale Invariant Convolutional Activations (S 2 ICA) based 
feature description method is presented in Fig [2] The 
detailed description of each of the blocks is given here. 
We first present our baseline CNN architecture followed 
by a detailed description of our spatially unstructured 
layer in Sec. |III- A Note that the spatially unstructured 
layer is introduced to achieve invariance to large scale 
spatial deformations, which are commonly encountered 
in images of indoor scenes. The baseline CNN architec¬ 
ture is pre-trained for a large scale classification task. A 
novel method is then proposed to adapt this pre-trained 
network for the specific task of scene categorization 


(Sec. III-B). Due to the data hungry nature of CNNs, it is 


not feasible to train a deep architecture with only a lim¬ 
ited amount of available training data. For this purpose, 
we pre-train a ‘TransferNet’, which is then appended 
with the initialized CNN and the whole network can then 
be efficiently fine-tuned for the scene classification task. 
Convolutional activations from this fine-tuned network 
are then used for a robust feature representation of the 
input images. To deal with the scale variations, we 
propose a pyramidal image representation and combine 
the activations from multiple levels which results in a 
scale invariant feature representation (Sec. |III-C| ). This 
representation is then finally used by a linear Support 
Vector Machine (SVM) for classification (Sec. mrp) . 


A. CNN Architecture 

Our baseline CNN architecture is presented in Fig [3] 
It consists of five convolutional layers and four fully 
connected layers. The architecture of our baseline CNN 
is similar to AlexNet (H- The main difference is that 
we introduce extra fully connected layer, and that all of 
our neighboring layers are densely connected (in contrast 
to the sparse connections in AlexNet). To achieve spatial 
layout invariance, the architecture of the baseline CNN 
is modified and a new unstructured layer is added after 
the first sub-sampling layer. A brief description of each 
layer of the network follows next. 

Let us suppose that the convolutional neural network 
consists of L hidden layers and each layer is indexed by 
l G {1... L}. The feed-forward pass can be described 
as a sequence of convolution, optional sub-sampling and 
normalization operations. The response of each convo¬ 
lution node in layer l is given by: 


a n = f (LA"' 1 * k ™,n) + b n J , 0) 

where k and b denote the learned kernel and bias, the 
indices (m, n) indicate that the mapping is from the m th 
feature map of the previous layer to the n th feature map 
of the current layer. The function / is the element-wise 
Rectified Linear Unit (ReLU) activation function. The 
response of each normalization layer is given by: 


J -1 


' min(N—l,n-\-cr) 

a + P E ( a ‘ 

i j=max(0,n—cr) 


7 ’ 


( 2 ) 




where ct, /?, 7 , oQare constants and N is the total number 
of kernels in the layer. The response of each sub- 


1 These constants are defined as in [ 14j: a — 2, f3 — It) 4 , 7 = 3/4 
and a — 5/2. 
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Fig. 2: Overview of the proposed Spatial Layout and Scale Invariant Convolutional Activations (S 2 ICA) based 
feature description method. Mid-level patches are extracted from three levels (A, B, C) of the pyramidal image 
representation. The extracted patches are separately feed-forwarded to the two trained CNNs (with and without 
the spatially unstructured layer). The convolutional activations based feature representation of the patches is then 
pooled and a single feature vector for the image is finally generated by concatenating the feature vectors from both 
CNNs. Figure best seen in color. 


sampling node is given by: 

a n = | a U + ( 3 > 

TxT 

where, k l n is the connection weight and T is the neigh¬ 
borhood size over which the values are pooled. 

In our proposed modified CNN architecture, a spa¬ 
tially unstructured layer follows the first sub-sampling 
layer and breaks the spatial order of the output feature 
maps. This helps in the generation of robust feature 
representations that can cope with the high variability 
of indoor scenes. For each feature response, we split the 
feature map into a specified number of blocks (n). Next, 
a matrix U is constructed whose elements correspond to 
the scope of each block defined as a tuple: 


H Vi l u i = (P» 9)}> (4) 

where, p and q indicate the starting and ending index 
of each block. To perform a local swapping operation, 
we define a matrix S in terms of an identity matrix / as 
follows: 

S 2 X2 = |/— 1|= ( J l ) (5) 


Next, a transformation matrix T G i s defined 

in terms of S as follows: 


T 


^fnx^fn 


s 

0 . 

.. 0 \ 

0 

s .. 

. 0 

0 

0 . 

• s / 


y/n/2Xy/n/2 


( 6 ) 


The transformation matrix T has the following proper¬ 
ties: 

• T = {tij} is a permutation matrix (T : {u ij} 

{u ij}) since the sum along each row and column 
is always equal to one i.e., = = 1- 

i j 

• T is a bistochastic matrix and therefore according 
to Birkhoffvon Neumann theorem and the above 
property, T lies on the convex hull of the set of 
bistochastic matrices. 

• It is a binary matrix with entries belonging to the 
Boolean domain {0,1}. 

• Its an orthogonal matrix i.e., TT t = / and T -1 = 
T t . 

Using the matrix T, we transform U to become: 

U = (U t T) t T = T t UT. (7) 

The updated matrix U contains the new indices of the 
modified feature maps. If 3^(*) is a function which reads 
the indices of the blocks stored in the form of tuples in 
matrix U, the layer output are as follows: 

a l n = r *y(a l n -\ij), ( 8 ) 

where, r ~ Bernoulli(p). (9) 

r is a random variable which has a probability p of being 
equal to 1. Note that this shuffling operation is applied 
randomly so that a network does not get biased towards 
the normal patches. Fig. [4] illustrates the distortion oper¬ 
ation performed by the spatially unstructured layer for a 
different number of blocks. 
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Key: 

C: Convolution Layer 
FC: Fully Connected Layer 
N: Normalization Layer 
P: Max-pooling Layer 
R: Rectified Linear Unit 
SM: Soft-max Layer 
SU: Spatially Unstructured 
Layer 


Fig. 3: The architecture of our proposed Convolutional Neural Network used to learn tailored feature representations 
for scene categorization. We devise a strategy (see Sec. |III-B| and Alg. [2]) to effectively adapt the learned feature 
representation from a large scale classification task to scene categorization. 


Algorithm 1 Operations Involved in Spatially Unstructured Layer 


Input: Feature map : F G M. pxqxrxs (R ), Number of Blocks : n 
// F is a real valued four dimensional matrix 
Output: Modified feature map (F m ) 



hpts + 1) linearly spaced points in range [1 : p\ 

hpts — 

hpts [end] + = 1 


Wpts — hpts 

for Vi G [1 : length{h pts ) — 1] do 
for Mj G [1 : length{w p ts) — 1] do 

Ftmp — F[/ip£ S (i) ; hpt s (i "F 1) 1? ^pts(j ) • Wpts(j T - 1) 1? • 

TOW s(Ft rn p') 

2 


Ftmp = [Ftmp( r ° W T mP; : end > 0; Ftmpil ■ 

Ftmp = [Ftmp(:, [ c °* s( f mp) ] : end ,:); F tmp (:, 1 : 
Fm [hpts (*) : h p ts (i “I - 1) 1 j^pts{j) • ujpts(j 4" 1*} 


T OW S (Ftmp) 
2 

COls^Ftinp ) 
2 


,:, 0 ] 


1 ,-,-} = F t 


imp 


return {R} 


// Rearrangement level 


// v p = q for F 


Original 2 16 Blocks 2 14 Blocks 2 2 Blocks 


(a) (b) (c) (d) 

Fig. 4: (left to right ) Original image and the spatially 
unstructured versions with 2 16 , 2 14 and 2 2 blocks re¬ 
spectively. 


B. Training CNNs for Indoor Scenes 

Deep CNNs have demonstrated exceptional feature 
representation capabilities for the classification and de¬ 
tection tasks (e.g., see ILSVRC’14 Results (32)). Train¬ 
ing deep CNNs however requires a large amount of data 
since the number of parameters to be learnt is huge. The 
requirement of a large amount of training data makes the 
training of CNNs infeasible where only a limited amount 
of annotated training data is available. In this paper, 
we propose to leverage from the image representations 
learnt on a large scale classification task (such as on 
ImageNet (32)) and propose a strategy to learn tailored 
feature representations for indoor scene categorization. 
An algorithmic description of our proposed strategy is 
summarized in Algorithm. [2] The details are presented 
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here. 

We first train our baseline CNN architecture on Im- 
ageNet database following the procedure in fl4| . Next, 
we densely extract mid-level image patches from our 
scene classification training data and represent them in 
terms of the convolutional activations of the trained 
baseline network. The output of the last convolution 
layer followed by ReLU non-linearity is considered as 
a feature representation of the extracted patches. These 
feature representations (J 7 ) will be used to train our 
TransferNet. 

As depicted in Fig [3j our TransferNet consists of 
three hidden layers (with 4096 neurons each) and an 
output layer, whose number of neurons are equal to 
the number of classes in the target dataset (e.g., indoor 
scenes dataset). TransferNet is trained on convolutional 
feature representations (J 7 ) of mid-level patches of the 
scene classification dataset. Specifically, the input to 
TransferNet are the feature representations (J 7 ) of the 
patches and the outputs are their corresponding class 
labels. After training TransferNet, we remove all fully 
connected layers of the baseline CNN and join the 
trained TransferNet to the last convolutional layer of the 
baseline CNN. The resulting network then consists of 
five convolutional layers and four fully connected layers 
(of the trained TransferNet). This complete network is 
now fine-tuned on the patches extracted from the training 
images of the scene classification data. Since the network 
initialization is quite good (the convolutional layers of 
the network are initialized from the baseline network 
trained on imageNet dataset, whereas the fully connected 
layers are initialized from the trained transferNet), only 
few epochs are required for the network to converge. 
Moreover, with a good initialization, it becomes feasible 
to learn deep CNN’s parameters even with a smaller 
number of available training images. 

Note that the baseline CNN was trained with images 
from the ImageNet database, where each image pre¬ 
dominantly contains one or multiple instances of the 
same object. In the case of scene categorization, we may 
deal with multiple distinct objects from a wide range 
of poses, appearances and scales across different spatial 
locations. Therefore, in order to incorporate large scale 
deformations, we train two CNNs: with and without the 
spatially unstructured layer (learned weights represented 
by W and W su respectively). These trained CNNs are 
then used for robust feature representation in Sec. |III-D 
Below, we first explain our approach in dealing with 
scale variations. 


Algorithm 2 Training CNNs for indoor scenes 

Input: Source DB (ImageNet), Target DB (Scene Im¬ 
ages) 

Output: Learned weights: {W}i xL , {W^jixL 

1: Pre-train the CNN on the large-scale Source DB. 

2: Feed-forward image patches from target DB to 
trained CNN. 

3: Take feature representations (J 7 ) from the last con¬ 
volution layer. 

4: Train the ‘TransferNet’ of 4 fully connected layers 
with T as input and target annotations as output. 

5: Append ‘TransferNet’ to the last convolution layer 
of trained CNN. 

6: Fine-tune the complete network with and without 
the spatially unstructured layer to get {W}i X £ and 
{W su }i xL respectively. 


C. Pyramid Image Representation 

In order to achieve scale invariance, we generate a 
pyramid of an image at multiple spatial resolutions. 
However, unlike conventional pyramid generation pro¬ 
cesses (e.g., Gaussian or Laplacian pyramid) where 
smoothing and sub-sampling operations are repeatedly 
applied, we simply resize each image to a set of scales 
and this may involve up or down sampling. Specifi¬ 
cally, we transform each image to three scales, {0.75 x 

D, D, 1.25 x D}, where D is the smaller dimension of 
an image which is set based on the given dataset. At 
each scale, we densely extract patches which are then 
encoded in terms of the convolutional activations of the 
trained CNNs. 

D. Image Representation and Classification 

From each of the three images of the pyramidal 
image representation, we extract multiple overlapping 
patches of 224 x 224 using a sliding window. A shift 
of 32 pixels is used between patches. The extracted 
image patches are then fed forwarded to the trained 
CNNs (both with and without the spatially unstructured 
layer). The convolutional feature representation of the 
patches are max-pooled to get a single feature vector 
representation for the image. This is denoted by A, B and 
C corresponding to three images of the pyramid in Fig [2] 
We then max pool the feature representations of these 
images and generate one single representation of the 
image for each network (with and without the spatially 
unstructured layer). The final feature representation is 
achieved by concatenating these two feature vectors. 
After encoding the spatial layout and the scale invariant 
feature representations for the images, the next step is to 
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perform classification. We use a simple linear Support 
Vector Machine (SVM) classifier for this purpose. 

IV. Experiments and Evaluation 

The proposed approach is validated through extensive 
experiments on a number of datasets. To this end, we per¬ 
form experiments on three indoor scene datasets (MIT- 
67, NYU and Scene-15). Amongst these datasets, MIT- 
67 is the largest dataset for indoor scene classification. 
The dataset is quite challenging since images of many 
classes are similar in appearance and thus hard to classify 
(see Fig. [8]). Apart from indoor scene classification, 
we further validate our approach for two other tasks 
i.e., event and object datasets (Graz-02 and Sports-8). 
Below (Sec. |IV-A| ), we first present a brief description 
about each of the datasets and the adopted experimental 
protocols. We then present our experimental results along 
with a comparison with existing state of the art in 


presented in Sec. IV-B 


Sec. |IV-B| An ablative analysis to study the individual 
effect of each component on the proposed method is also 


A. Datasets 

The MIT-67 Dataset contains a total of 15620 images of 
67 indoor scene classes. For our experiments, we follow 
the standard evaluation protocol in [[30]. Specifically, 100 
images per class are considered, out of which 80 are used 
for training and the remaining 20 are used for testing. 
We therefore have a total of 5360 and 1340 images for 
training and testing respectively. 

The 15 Category Scene Dataset contains images of 15 
urban and natural scene classes. The number of images 
for each scene class in the dataset ranges from 200- 
400. For performance evaluation and comparison with 
existing state of the art, we follow the standard evaluation 
protocol in fT6| , where 100 images per class are selected 
for training and the rest for testing. 

The NYU vl Indoor Scene Dataset contains a total of 
2347 images belonging to 7 indoor scene categories. We 
follow the evaluation protocol described in |35]| and use 
the first 60% of the images of each class for training and 
the last 40% images for testing. 

The Inria Graz 02 Dataset contains a total of 1096 
images of three classes (bikes, cars and people). The 
images of this dataset exhibit a wide range of appearance 
variations in the form of heavy clutter, occlusions and 


pose changes. The evaluation protocol defined in [24] is 
used in our experiments. Specifically, the training and 
testing splits are generated by considering the first 150 
odd images for training and the first 150 even images 
for testing. 


The UIUC Sports Event Dataset contains 1574 images 
of 8 sports event categories. Following the protocol 
defined in (TV), we used 70 and 60 randomly sampled 
images per category for training and testing respectively. 


B. Results and Analysis 

The quantitative results of the proposed method in 
terms of classification rates for the task of indoor scene 
categorization are presented in Tables [j] [HI] and |Vj A 
comparison with the existing state of the art techniques 
shows that the proposed method consistently achieves a 
superior performance on all datasets. We also evaluate 
the proposed method for the tasks of sports events and 
highly occluded object classification (Tables |II| and [TV|). 
The results show that the proposed method achieves 
very high classification rates. The experimental results 
suggest that the gain in performance of our method 
is more significant and pronounced for the MIT-67, 
Scene-15, Graz-02 and Sports-8 datasets. The confusion 
matrices showing the class wise accuracies of Scene-15, 
Sports-8 and NYU datasets are presented in Fig. |6j The 
confusion matrix for the MIT-67 scene dataset is given 
in Fig. |5j It can be noted that all the confusion matrices 
have a very strong diagonal (Fig. [5] and Fig. [6). The 
majority of the confused testing samples belong to very 
closely related classes e.g., living room is confused with 
bedroom , office with computer-room , coast with open- 
country and croquet with bocce. 

The superior performance of our method is attributed 
to its ability to handle a large spatial layout (through 
the introduction of the spatially unstructured layer in 
our modified CNN architecture) and scale variations 
(achieved by the proposed pyramidal image represen¬ 
tation). Further, our method is based on deep convo¬ 
lutional representations, which have recently shown to 
be superior in performance over shallow handcrafted 
feature representations (9j, pT| , (32). A number of 
compared methods are based upon mid-level feature 
representations (e.g., 0, (H), (38)). Our results show 
that our proposed method achieves superior performance 
over these methods. It should be noted that in contrast to 
existing mid-level feature representation based methods 
(whose main focus is on the automatic discovery of 
discriminative mid-level patches) our method simply 
densely extracts mid-level patches from uniform loca¬ 
tions across an image. This is computationally very 
efficient since we do not need to devise patch selection 
and sorting strategies. Further, our dense patch extraction 
is similar to dense keypoint extraction, which has shown 
a comparable performance with sophisticated keypoint 
extraction methods over a number of classification tasks 
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Fig. 6: Confusion matrices for Scene-15, Sports-8 and NYU scene classification datasets. Figure best seen in color. 


MIT-67 Indoor Scenes Dataset 


Method 

Accuracy(%) 

Method 

Accuracy (%) 

ROI + GIST [CVPR’09]j30| 

26.1 

OTC [ECCVT4] (23) 

47.3 

MM-Scene [NIPST0] j50j 

28.3 

Discriminative Patches [ECCVT2] [37] 

49.4 

SPM [CVPR’06] fl6] 

34.4 

ISPR [CVPRT4] (511 

50.1 

Object Bank [NIPS’10] (181 

37.6 

D-Parts [ICCVT3]lg_ 

51.4 

RBoW [CVPRT2] [29] 

37.9 

VC + VQ [CVPR’13] m 

52.3 

Weakly Supervised DPM [ICCVT1] (28] 

43.1 

IFV [CVPR’13] 112l 

60.8 

SPMSM [ECCVT2] fi5] 

44.0 

MLRep [NIPST3] |4] 

64.0 

LPR-LIN [ECCVT2] M) 

44.8 

CNN-MOP [ECCVT4] [7] 

68.9 

BoP [CVPRT3] 112] 

46.1 

CNNaug-SVM [CVPRw’14] j31| 

69.0 

Hybrid Parts + GIST + SP [ECCVT2] (49) 

47.2 

Proposed S^ICA 

71.2 


TABLE I: Mean accuracy on the MIT-67 indoor scenes dataset. 
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UIUC Sports-8 Dataset 


Graz-02 Dataset 


Method 


GIST-color [IJCY’01] 26 
MM-Scene [NIPS’10] 50 
Graphical Model [ICCV’07] 
Object Bank [NIPS’10] [18] 
Object Attributes [ECCV’12] 
CENTRIST [PAMI’ll] |45 
RSP [ECCV’12] [11 
SPM [CVPR’06] 

SPMSM [ECCV’l' 

Classemes [ECCV’IO' 

HIK [ICCV’09] El 
LScSPM [CVPR’10] [6 
LPR-RBF [ECCV’12] 

Hybrid Parts + GIST +' 
LCSR [CVPR’12] [341 
VC + VQ [CVPR’13] [ 20] 
IFV [|43j 

ISPR [CVPR’ 14] [iTj 



[ECCV’12] |49| 


Proposed S ICA 


81.8 

83.0 

84.2 

84.2 

85.3 
86.2 
87.2 
87.2 

88.4 
90.8 

89.5 


95.8 


Accuracy (%) 

Cars 

People 

Bikes 

Overall 

70.7 

OLB [SCIA’05] J27] 

70.7 

81.0 

76.5 

76.1 

71.7 

VQ [ICCV’07] [4 

80.2 

85.2 

89.5 

85.0 

73.4 

ERC-F [PAMF 08025] 

79.9 

- 

84.4 

82.1 

76.3 

TSD-IB [BMVC’ll] fl3j 

87.5 

85.3 

91.2 

88.0 

77.9 

TSD-k [BMVC’ll] (73} 

84.8 

87.3 

90.7 

87.6 

78.2 

Proposed S^ICA 

98.7 

97.7 

97.7 

98.0 


TABLE II: Mean accuracy on the UIUC Sports-8 dataset. 


TABLE IV: Equal Error Rates (EER) on Graz-02 dataset. 



Fig. 7: The contributions (red: most; blue: least) of mid¬ 
level patches towards correct class prediction. Best seen 
in color. 


NYU Indoor Scenes Dataset 


Method 

Accuracy (%) 

BoW-SIFT [ICCVw’ll] [35] 

55.2 

RGB-LLC [TC’13] Eoj 

78.1 

RGB-LLC-RPSL [TC’13] (40j 

79.5 

Proposed S 2 ICA 

81.2 


TABLE III: Mean Accuracy for the NYU vl dataset. 


©• The contributions of the extracted mid-level patches 
towards a correct classification are shown in the form of 
heat maps for some example images in Fig [7] It can be 
seen that our proposed spatial layout and scale invariant 
convolutional activations based feature descriptor gives 
automatically more importance to the meaningful and 
information rich parts of an image. 

The actual and predicted labels of some miss-classified 
images from MIT-67 dataset are shown in Fig [8] Note the 
extremely challenging nature of the images in the pres¬ 
ence of high inter-class similarities. Some of the classes 
are very challenging and there is no visual indication to 
determine the actual label. It can be seen that the miss- 
classified images belong to highly confusing and very 
similar looking scene types. For example, the image of 
inside subway is miss-classified as inside bus , library 
as bookstore , movie theater as auditorium and office as 
classroom. 

An ablative analysis to assess the effect of each 
individual component of the proposed technique towards 
the overall performance is presented in Table [VI] Specifi¬ 
cally, the contributions of the proposed spatially unstruc¬ 


tured layer, pyramid image representation, training of the 
CNN on the target dataset and pooling (mean pooling 
and max pooling) are investigated. In order to investigate 
a specific componenet of the proposed method, we only 
modify (add or remove) that part, while the rest of 
the pipeline is kept fixed. The experimental results in 
Table [VI] show that the feature representations from 
trained CNNs with and without the spatially unstructured 
layer complement each other and achieve the best per¬ 
formance. Furthermore, the proposed pyramidal image 
representation also contributes significantly towards the 
performance improvement of the proposed method. Our 
proposed strategy to adapt a deep CNN (trained on a 
large scale classification task) for scene categorization 
also proves to be very effective and it results in a sig¬ 
nificant performance improvement. Amongst the pooling 
strategies, max pooling provides a superior performance 
compared with mean pooling. 

V. Conclusion 

This paper proposed a novel approach to handle the 
large scale deformations caused by spatial layout and 
scale variations in indoor scenes. A pyramidal image 
representation has been contrived to deal with scale 
variations. A modified Convolutional Neural Network 
Architecture with an added layer has been introduced to 
deal with the variations caused by spatial layout changes. 
In order to feasibly train a CNN on tasks with only a 
limited annotated training dataset, the paper proposed an 
efficient strategy which conveniently transfers learning 
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15 Category Scene Dataset 


Method 

Accuracy (%) 

Method 

Accuracy (%) 

GIST-color [IJCV’01] [26] 

69.5 

ISPR [CVPR’14] [21 


85.1 

RBoW [CVPRT 2] |29] 

78.6 

VC + VQ [CVPRT3] 


85.4 

Classemes [ECCVtVt fy] 

80.6 

LMLF [CVPR’10] (2 


85.6 

Object Bank [NIPS’10] [T8| 

80.9 

LPR-RBF [ECCVT2J 

O 

85.8 

SPM [CVPR’06] |[l6| 

81.4 

Hybrid Parts + GIST 

+ SP [ECCVT2] [59J_ 

86.3 

SPMSM [ECCV’fcffui 

82.3 

CENTRIST+LCC+Boosting [CVPRTlJTpS] 

87.8 

LCSR [CVPRT2] f||p 

82.7 

RSP [ECCVT2] 111] 


88.1 

SP-pLSA [PAMPOBJlm 

83.7 

IFV (43) 


89.2 

CENTRIST [PAMI’llJ [45} 

83.9 

LScSPM [CVPRTO] |6| 

89.7 

HIK [ICCV’09] (44| 

84.1 




OTC [ECCV’14] |23) 

84.4 

Proposed S^ICA 

93.1 


TABLE V: Mean accuracy on the 15 Category scene dataset. Comparisons with the previous best techniques are 
also shown. 



Actual: Airport Inside, Pred: Prison Cell Actual: Gameroom, Pred: Pool Inside Actual: Airport Inside, Pred: Auditorium Actual: Museum, Pred: Train station Actual: Inside Subway, Pred: Inside Bus Actual: Kindergarten, Pred: Gameroom 



Actual: Office, Pred: Classroom Actual: Movietheatre, Pred: Auditorium Actual: Mall, Pred: Airport Inside Actual: Livingroom, Pred: Waitingroom Actual: Library, Pred: Bookstore Actual: Airport Inside, Pred: Lobby 


Fig. 8: Some examples of misclassified images from MIT-67 indoor scenes dataset. Actual and predicted labels of 
each image are given. Images from highly similar looking classes are confused amongst each other. For example, 
the proposed method misclassifies library as bookstore , office as classroom and inside subway as inside bus. 


Baseline CNN (w/o Spatially Unstructured layer) 
Modified CNN (with Spatially Unstructured layer) 
Baseline CNN + Modified CNN 

65.4% 

65.9% 

71.2% 

w/o pyramidal representation 

68.5% 

with pyramidal representation 

71.2% 

CNN trained on imageNet 

67.3% 

CNN trained on imageNet+MIT-67 

71.2% 

Mean-pooling 

65.7% 

Max-pooling 

71.2% 


TABLE VI: Ablative analysis on MIT-67 dataset. The 
joint feature representations from baseline and modi¬ 
fied CNNs gives the best performance. The proposed 
pyramidal image representation results in a significant 
performance boost. 


from a large scale dataset. A robust feature representation 
of an image is then achieved by extracting mid-level 
patches and encoding them in terms of the convolutional 
activations of the trained networks. Leveraging on the 
proposed spatial layout and scale invariant image repre¬ 
sentation, state of the art classification performance has 


been achieved by using a simple linear SVM classifier. 
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